Problem Statement

WHO wants a data driven approach which could help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population. Life_Expectancy_Data.csv file contains the data

Solution

Building a Prediction engine which predicts the life expectancy based on various features like status of the country, GDP, Alcohol consumption, adult mortality rate etc. To build a prediction engine we will use linear regression model.

Purpose: To perform regression analysis to study life expectancy in different countries

Linear Regression for the above business problem.

  1. Loading the Data.
  2. Understanding the Data
  3. Data Preprocessing
  4. Exploratory Data Analysis
  5. Model Building
  6. Model Diagnostics
  7. Predictions and Evaluations

Importing Libraries- Setting up the environment

  1. Libraries Numpy and Panda are used for data manipulation and data loading.
  2. Libraries Seaborn and Matplotlib are used for data visualization.
  3. Libraries Sklearn and Statsmodel for data preprocessing and model building.
In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows',800)
pd.set_option('display.max_columns',500)

import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline 

# importing all libraries and dependencies for machine learning
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score
import random

1. Loading the data

In [2]:
df = pd.read_csv("~/data/Life_Expectancy_Data.csv")

2.Understanding the data

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15   HIV/AIDS                        2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18   thinness  1-19 years            2904 non-null   float64
 19   thinness 5-9 years              2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB
In [4]:
df.describe()
Out[4]:
Year Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles BMI under-five deaths Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
count 2938.000000 2928.000000 2928.000000 2938.000000 2744.000000 2938.000000 2385.000000 2938.000000 2904.000000 2938.000000 2919.000000 2712.00000 2919.000000 2938.000000 2490.000000 2.286000e+03 2904.000000 2904.000000 2771.000000 2775.000000
mean 2007.518720 69.224932 164.796448 30.303948 4.602861 738.251295 80.940461 2419.592240 38.321247 42.035739 82.550188 5.93819 82.324084 1.742103 7483.158469 1.275338e+07 4.839704 4.870317 0.627551 11.992793
std 4.613841 9.523867 124.292079 117.926501 4.052413 1987.914858 25.070016 11467.272489 20.044034 160.445548 23.428046 2.49832 23.716912 5.077785 14270.169342 6.101210e+07 4.420195 4.508882 0.210904 3.358920
min 2000.000000 36.300000 1.000000 0.000000 0.010000 0.000000 1.000000 0.000000 1.000000 0.000000 3.000000 0.37000 2.000000 0.100000 1.681350 3.400000e+01 0.100000 0.100000 0.000000 0.000000
25% 2004.000000 63.100000 74.000000 0.000000 0.877500 4.685343 77.000000 0.000000 19.300000 0.000000 78.000000 4.26000 78.000000 0.100000 463.935626 1.957932e+05 1.600000 1.500000 0.493000 10.100000
50% 2008.000000 72.100000 144.000000 3.000000 3.755000 64.912906 92.000000 17.000000 43.500000 4.000000 93.000000 5.75500 93.000000 0.100000 1766.947595 1.386542e+06 3.300000 3.300000 0.677000 12.300000
75% 2012.000000 75.700000 228.000000 22.000000 7.702500 441.534144 97.000000 360.250000 56.200000 28.000000 97.000000 7.49250 97.000000 0.800000 5910.806335 7.420359e+06 7.200000 7.200000 0.779000 14.300000
max 2015.000000 89.000000 723.000000 1800.000000 17.870000 19479.911610 99.000000 212183.000000 87.300000 2500.000000 99.000000 17.60000 99.000000 50.600000 119172.741800 1.293859e+09 27.700000 28.600000 0.948000 20.700000
In [5]:
df.head(5)
Out[5]:
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles BMI under-five deaths Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 19.1 83 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 18.6 86 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 18.1 89 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 17.6 93 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 17.2 97 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5
In [6]:
num_col = df.select_dtypes(include=np.number).columns
print("Numerica columns are as follows.\n",num_col)
cat_col = df.select_dtypes(exclude=np.number).columns
print("Categorical columns are as follows.\n",cat_col)
Numerica columns are as follows.
 Index(['Year', 'Life expectancy ', 'Adult Mortality', 'infant deaths',
       'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ',
       'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ',
       ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years',
       ' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')
Categorical columns are as follows.
 Index(['Country', 'Status'], dtype='object')

3. Data Pre-Processing

In [7]:
#Removing the extra space from column names.
#Strip will remove spaces before and after the column names,if any. 
#Using Lambda helps scaling up to all the columns.
df = df.rename(columns = lambda x : x.strip())
In [8]:
#converting categorical features to numerical features for the machine to understand.
from sklearn import preprocessing
#Label encoder understands labels
label_encoder = preprocessing.LabelEncoder()
#Status "Developing" will be 1 and "Developed" will be 0
df['Status'] = label_encoder.fit_transform(df['Status'])
df.head()
Out[8]:
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles BMI under-five deaths Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 1 65.0 263.0 62 0.01 71.279624 65.0 1154 19.1 83 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 1 59.9 271.0 64 0.01 73.523582 62.0 492 18.6 86 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 1 59.9 268.0 66 0.01 73.219243 64.0 430 18.1 89 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 1 59.5 272.0 69 0.01 78.184215 67.0 2787 17.6 93 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 1 59.2 275.0 71 0.01 7.097109 68.0 3013 17.2 97 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5
In [9]:
#Knowing Missing values
print(df.isna().sum())
print(df.shape)
Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
thinness  1-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163
dtype: int64
(2938, 22)
In [10]:
# Treating the NA values
# Replacing the NA value with Mean of the column
for i in df.columns.drop('Country'):
    df[i].fillna(df[i].mean(),inplace = True)

4. Exploratory Data Analysis

In [11]:
#Checking the distribution of Y variable.
plt.figure(figsize=(8,8),dpi=80)
sns.boxplot(df['Life expectancy'])
plt.title('Life Exp Box plot')
plt.show()
In [12]:
plt.figure(figsize=(8,8))
plt.title('Life Exp Distribution plot')
sns.distplot(df['Life expectancy'])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a140e6e10>

Summary : The Y variable has few outliers and is almost linearlly distributed. Hence the Assumption of the linear regression holds true.

In [13]:
num_col = df.select_dtypes(include = np.number).columns
cat_col = df.select_dtypes(exclude = np.number).columns
print('Numerical columns are ',num_col)
print('categorical columns are ',cat_col)
Numerical columns are  Index(['Year', 'Status', 'Life expectancy', 'Adult Mortality', 'infant deaths',
       'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles', 'BMI',
       'under-five deaths', 'Polio', 'Total expenditure', 'Diphtheria',
       'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')
categorical columns are  Index(['Country'], dtype='object')
In [14]:
#checking the multicolleaniarity of features by checking the correlation matrix
plt.figure(figsize=(16,16))
p = sns.heatmap(df[num_col].corr(), annot = True, cmap = 'RdYlGn', center = 0)
In [15]:
# To know the relation between different features
ax = sns.pairplot(df[num_col])

5. Model Building

In [16]:
# Train test Split (70% data - Train, 30% data - Test)
X= df.drop(columns=['Life expectancy','Country'])
y=df[['Life expectancy']]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1234)

Approach 1: Adding 1 variable at a time.

Building model with 1 variable

In [17]:
X_train1 = X_train['Income composition of resources']
In [18]:
# Adding a constant
X_train1 = sm.add_constant(X_train1)

# Creating a OLS model

model_1 = sm.OLS(y_train, X_train1).fit()
In [19]:
model_1.params
Out[19]:
const                              48.440947
Income composition of resources    33.059741
dtype: float64
In [20]:
print(model_1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.490
Model:                            OLS   Adj. R-squared:                  0.490
Method:                 Least Squares   F-statistic:                     1974.
Date:                Tue, 22 Sep 2020   Prob (F-statistic):          1.09e-302
Time:                        22:18:19   Log-Likelihood:                -6894.3
No. Observations:                2056   AIC:                         1.379e+04
Df Residuals:                    2054   BIC:                         1.380e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              48.4409      0.492     98.412      0.000      47.476      49.406
Income composition of resources    33.0597      0.744     44.427      0.000      31.600      34.519
==============================================================================
Omnibus:                      138.959   Durbin-Watson:                   2.047
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              617.560
Skew:                           0.121   Prob(JB):                    7.92e-135
Kurtosis:                       5.674   Cond. No.                         6.86
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Building model with 2 variables

In [21]:
#Adding Schooling feature to the regression model
X_train2 = X_train[['Income composition of resources','Schooling']]
In [22]:
#Adding constant
X_train2 = sm.add_constant(X_train2)
In [23]:
model_2 = sm.OLS(y_train, X_train2).fit()
In [24]:
model_2.params
Out[24]:
const                              43.145928
Income composition of resources    16.273079
Schooling                           1.320315
dtype: float64
In [25]:
print(model_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.561
Method:                 Least Squares   F-statistic:                     1316.
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:19   Log-Likelihood:                -6738.3
No. Observations:                2056   AIC:                         1.348e+04
Df Residuals:                    2053   BIC:                         1.350e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              43.1459      0.540     79.895      0.000      42.087      44.205
Income composition of resources    16.2731      1.146     14.197      0.000      14.025      18.521
Schooling                           1.3203      0.072     18.340      0.000       1.179       1.461
==============================================================================
Omnibus:                      182.792   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              596.381
Skew:                          -0.427   Prob(JB):                    3.14e-130
Kurtosis:                       5.497   Cond. No.                         101.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Building with 3 variables

In [26]:
# adding 3rd feature in regression model 
X_train3 = X_train[['Income composition of resources', 'Schooling', 'Adult Mortality']]
In [27]:
#adding a constant
X_train3 = sm.add_constant(X_train3)
In [28]:
#creating fitted model
model_3 = sm.OLS(y_train, X_train3).fit()
In [29]:
#checking the parameters
model_3.params
Out[29]:
const                              56.227689
Income composition of resources    10.637516
Schooling                           1.003654
Adult Mortality                    -0.034790
dtype: float64
In [30]:
#Printing the model summary
print(model_3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.721
Model:                            OLS   Adj. R-squared:                  0.720
Method:                 Least Squares   F-statistic:                     1765.
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:19   Log-Likelihood:                -6275.3
No. Observations:                2056   AIC:                         1.256e+04
Df Residuals:                    2052   BIC:                         1.258e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              56.2277      0.577     97.502      0.000      55.097      57.359
Income composition of resources    10.6375      0.930     11.438      0.000       8.814      12.461
Schooling                           1.0037      0.058     17.236      0.000       0.889       1.118
Adult Mortality                    -0.0348      0.001    -34.168      0.000      -0.037      -0.033
==============================================================================
Omnibus:                      379.309   Durbin-Watson:                   1.962
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1628.478
Skew:                          -0.829   Prob(JB):                         0.00
Kurtosis:                       7.032   Cond. No.                     1.72e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Approach 2: Solving by Recursive Feature elimination(RFE) and eliminating by using P-Value and VIF

In [31]:
#Running RFE with important columns(count 15 imp. features)
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm, 15)
rfe = rfe .fit(X_train, y_train)
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
In [32]:
list(zip(X_train.columns, rfe.support_,rfe.ranking_))
Out[32]:
[('Year', False, 2),
 ('Status', True, 1),
 ('Adult Mortality', True, 1),
 ('infant deaths', True, 1),
 ('Alcohol', True, 1),
 ('percentage expenditure', False, 3),
 ('Hepatitis B', True, 1),
 ('Measles', False, 5),
 ('BMI', True, 1),
 ('under-five deaths', True, 1),
 ('Polio', True, 1),
 ('Total expenditure', True, 1),
 ('Diphtheria', True, 1),
 ('HIV/AIDS', True, 1),
 ('GDP', False, 4),
 ('Population', False, 6),
 ('thinness  1-19 years', True, 1),
 ('thinness 5-9 years', True, 1),
 ('Income composition of resources', True, 1),
 ('Schooling', True, 1)]
In [33]:
#Selecting important features basis support
imp_columns = X_train.columns[rfe.support_]
In [34]:
print(imp_columns)
Index(['Status', 'Adult Mortality', 'infant deaths', 'Alcohol', 'Hepatitis B',
       'BMI', 'under-five deaths', 'Polio', 'Total expenditure', 'Diphtheria',
       'HIV/AIDS', 'thinness  1-19 years', 'thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')
In [35]:
#creating X_train dataframe with RFE selected variables 
X_train_rfe = X_train[imp_columns]

After passing selected columns by RFE, manually evaluate each models p-value and VIF value. Until the acceptable range for p-values and VIF is found, drop the variables one at a time basis below criteria.

  • High VIF High p-value : Drop the variable
  • Low VIF High p-value : Drop the variable with high p-value first
  • Low VIF Low p-value : Accept the variable
In [36]:
random.seed(0)

#Add constant
X_train_rfec = sm.add_constant(X_train_rfe)

#model with RFE features
lm_rfe = sm.OLS(y_train, X_train_rfec).fit()

print(lm_rfe.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     620.1
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:19   Log-Likelihood:                -5823.0
No. Observations:                2056   AIC:                         1.168e+04
Df Residuals:                    2040   BIC:                         1.177e+04
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              56.3507      0.815     69.162      0.000      54.753      57.949
Status                             -2.0823      0.313     -6.663      0.000      -2.695      -1.469
Adult Mortality                    -0.0200      0.001    -20.474      0.000      -0.022      -0.018
infant deaths                       0.0944      0.010      9.677      0.000       0.075       0.114
Alcohol                             0.0403      0.031      1.289      0.198      -0.021       0.102
Hepatitis B                        -0.0209      0.005     -4.303      0.000      -0.030      -0.011
BMI                                 0.0488      0.006      7.963      0.000       0.037       0.061
under-five deaths                  -0.0714      0.007     -9.983      0.000      -0.085      -0.057
Polio                               0.0272      0.005      4.962      0.000       0.016       0.038
Total expenditure                   0.0768      0.042      1.835      0.067      -0.005       0.159
Diphtheria                          0.0456      0.006      7.807      0.000       0.034       0.057
HIV/AIDS                           -0.4970      0.024    -20.592      0.000      -0.544      -0.450
thinness  1-19 years               -0.0739      0.061     -1.212      0.226      -0.193       0.046
thinness 5-9 years                  0.0032      0.060      0.054      0.957      -0.114       0.120
Income composition of resources     6.4836      0.769      8.435      0.000       4.976       7.991
Schooling                           0.6908      0.051     13.544      0.000       0.591       0.791
==============================================================================
Omnibus:                      110.684   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              321.018
Skew:                          -0.238   Prob(JB):                     1.96e-70
Kurtosis:                       4.876   Cond. No.                     2.45e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [37]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Out[37]:
Features VIF
6 under-five deaths 178.16
2 infant deaths 177.70
14 Schooling 44.59
13 Income composition of resources 30.42
9 Diphtheria 30.31
7 Polio 26.28
11 thinness 1-19 years 19.47
12 thinness 5-9 years 19.31
4 Hepatitis B 19.00
5 BMI 8.28
8 Total expenditure 7.74
0 Status 7.13
1 Adult Mortality 4.42
3 Alcohol 4.35
10 HIV/AIDS 1.70

Dropping 'thinness 5 - 9 yrs' feature from training dataset as it has high p value.

In [38]:
X_train_rfe1 = X_train_rfe.drop(['thinness 5-9 years'],1,)

#add constant
X_train_rfe1b = sm.add_constant(X_train_rfe1)

#building model
lm_rfe1 = sm.OLS(y_train, X_train_rfe1b).fit()

#summary
print(lm_rfe1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     664.7
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -5823.0
No. Observations:                2056   AIC:                         1.168e+04
Df Residuals:                    2041   BIC:                         1.176e+04
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              56.3537      0.813     69.336      0.000      54.760      57.948
Status                             -2.0819      0.312     -6.665      0.000      -2.695      -1.469
Adult Mortality                    -0.0200      0.001    -20.484      0.000      -0.022      -0.018
infant deaths                       0.0945      0.010      9.703      0.000       0.075       0.114
Alcohol                             0.0403      0.031      1.288      0.198      -0.021       0.102
Hepatitis B                        -0.0209      0.005     -4.304      0.000      -0.030      -0.011
BMI                                 0.0487      0.006      8.014      0.000       0.037       0.061
under-five deaths                  -0.0715      0.007    -10.003      0.000      -0.085      -0.057
Polio                               0.0272      0.005      4.963      0.000       0.016       0.038
Total expenditure                   0.0767      0.042      1.835      0.067      -0.005       0.159
Diphtheria                          0.0456      0.006      7.811      0.000       0.034       0.057
HIV/AIDS                           -0.4969      0.024    -20.599      0.000      -0.544      -0.450
thinness  1-19 years               -0.0710      0.029     -2.427      0.015      -0.128      -0.014
Income composition of resources     6.4837      0.768      8.437      0.000       4.977       7.991
Schooling                           0.6909      0.051     13.548      0.000       0.591       0.791
==============================================================================
Omnibus:                      110.689   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              320.883
Skew:                          -0.238   Prob(JB):                     2.09e-70
Kurtosis:                       4.876   Cond. No.                     2.45e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [39]:
#creating dataframe that contains feature names and its VIF
vif = pd.DataFrame()
vif['Features'] = X_train_rfe1.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe1.values,i) for i in range(X_train_rfe1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif= vif.sort_values(by ='VIF', ascending = False)
vif
Out[39]:
Features VIF
6 under-five deaths 177.82
2 infant deaths 177.16
13 Schooling 44.55
12 Income composition of resources 30.42
9 Diphtheria 30.30
7 Polio 26.28
4 Hepatitis B 18.99
5 BMI 8.19
8 Total expenditure 7.74
0 Status 7.10
1 Adult Mortality 4.41
3 Alcohol 4.35
11 thinness 1-19 years 4.07
10 HIV/AIDS 1.70

Under five deaths has high VIF hence dropping that feature

In [40]:
X_train_rfe2 = X_train_rfe1.drop('under-five deaths',1,)

#adding constant
X_train_rfe2c = sm.add_constant(X_train_rfe2)

#building model
lm_rfe2 = sm.OLS(y_train, X_train_rfe2c).fit()

#summary
print(lm_rfe2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.811
Model:                            OLS   Adj. R-squared:                  0.810
Method:                 Least Squares   F-statistic:                     675.4
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -5872.2
No. Observations:                2056   AIC:                         1.177e+04
Df Residuals:                    2042   BIC:                         1.185e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              55.1520      0.823     67.005      0.000      53.538      56.766
Status                             -2.0413      0.320     -6.382      0.000      -2.668      -1.414
Adult Mortality                    -0.0204      0.001    -20.368      0.000      -0.022      -0.018
infant deaths                      -0.0025      0.001     -2.797      0.005      -0.004      -0.001
Alcohol                            -0.0038      0.032     -0.121      0.904      -0.066       0.058
Hepatitis B                        -0.0243      0.005     -4.887      0.000      -0.034      -0.015
BMI                                 0.0502      0.006      8.071      0.000       0.038       0.062
Polio                               0.0307      0.006      5.484      0.000       0.020       0.042
Total expenditure                   0.0825      0.043      1.928      0.054      -0.001       0.166
Diphtheria                          0.0536      0.006      9.039      0.000       0.042       0.065
HIV/AIDS                           -0.5147      0.025    -20.893      0.000      -0.563      -0.466
thinness  1-19 years               -0.0578      0.030     -1.932      0.053      -0.116       0.001
Income composition of resources     7.1638      0.784      9.140      0.000       5.627       8.701
Schooling                           0.7020      0.052     13.448      0.000       0.600       0.804
==============================================================================
Omnibus:                      110.170   Durbin-Watson:                   1.988
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              309.575
Skew:                          -0.250   Prob(JB):                     5.98e-68
Kurtosis:                       4.834   Cond. No.                     2.30e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [41]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe2.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe2.values,i) for i in range(X_train_rfe2.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Out[41]:
Features VIF
12 Schooling 44.53
11 Income composition of resources 30.29
8 Diphtheria 29.80
6 Polio 26.22
4 Hepatitis B 18.76
5 BMI 8.19
7 Total expenditure 7.73
0 Status 7.06
1 Adult Mortality 4.38
3 Alcohol 4.23
10 thinness 1-19 years 4.07
9 HIV/AIDS 1.69
2 infant deaths 1.47

Since variable Alcohol has a very high P value eliminating that feature from the training dataset

In [42]:
#Dropping insignificant variable Alcohol
X_train_rfe3 = X_train_rfe2.drop('Alcohol',1,)

#adding constant
X_train_rfe3c = sm.add_constant(X_train_rfe3)

#model building
lm_rfe3 = sm.OLS(y_train, X_train_rfe3c).fit()

#summary
print(lm_rfe3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.811
Model:                            OLS   Adj. R-squared:                  0.810
Method:                 Least Squares   F-statistic:                     732.0
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -5872.2
No. Observations:                2056   AIC:                         1.177e+04
Df Residuals:                    2043   BIC:                         1.184e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              55.1373      0.814     67.751      0.000      53.541      56.733
Status                             -2.0259      0.293     -6.904      0.000      -2.601      -1.450
Adult Mortality                    -0.0204      0.001    -20.439      0.000      -0.022      -0.018
infant deaths                      -0.0025      0.001     -2.826      0.005      -0.004      -0.001
Hepatitis B                        -0.0243      0.005     -4.887      0.000      -0.034      -0.015
BMI                                 0.0502      0.006      8.073      0.000       0.038       0.062
Polio                               0.0307      0.006      5.484      0.000       0.020       0.042
Total expenditure                   0.0819      0.042      1.928      0.054      -0.001       0.165
Diphtheria                          0.0536      0.006      9.043      0.000       0.042       0.065
HIV/AIDS                           -0.5149      0.025    -20.942      0.000      -0.563      -0.467
thinness  1-19 years               -0.0571      0.029     -1.949      0.051      -0.114       0.000
Income composition of resources     7.1646      0.784      9.144      0.000       5.628       8.701
Schooling                           0.7009      0.051     13.647      0.000       0.600       0.802
==============================================================================
Omnibus:                      110.297   Durbin-Watson:                   1.988
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              309.319
Skew:                          -0.251   Prob(JB):                     6.80e-68
Kurtosis:                       4.833   Cond. No.                     2.28e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [43]:
#VIF for all features
vif = pd.DataFrame()
vif['Features'] = X_train_rfe3.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe3.values,i) for i in range(X_train_rfe3.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Out[43]:
Features VIF
11 Schooling 42.11
10 Income composition of resources 30.28
7 Diphtheria 29.80
5 Polio 26.16
3 Hepatitis B 18.73
4 BMI 8.18
6 Total expenditure 7.49
0 Status 6.05
1 Adult Mortality 4.30
9 thinness 1-19 years 3.96
8 HIV/AIDS 1.69
2 infant deaths 1.45

Dropping 'Schooling' Feature from training data set since it has high VIF

In [44]:
X_train_rfe4 = X_train_rfe3.drop('Schooling',1,)

#add a constant
X_train_rfe4c = sm.add_constant(X_train_rfe4)

#Model building
lm_rfe4 = sm.OLS(y_train, X_train_rfe4c).fit()

#Summary
print(lm_rfe4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     716.7
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -5961.9
No. Observations:                2056   AIC:                         1.195e+04
Df Residuals:                    2044   BIC:                         1.202e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              58.7499      0.804     73.101      0.000      57.174      60.326
Status                             -2.6868      0.302     -8.889      0.000      -3.280      -2.094
Adult Mortality                    -0.0217      0.001    -20.948      0.000      -0.024      -0.020
infant deaths                      -0.0029      0.001     -3.092      0.002      -0.005      -0.001
Hepatitis B                        -0.0249      0.005     -4.801      0.000      -0.035      -0.015
BMI                                 0.0626      0.006      9.740      0.000       0.050       0.075
Polio                               0.0360      0.006      6.188      0.000       0.025       0.047
Total expenditure                   0.1155      0.044      2.608      0.009       0.029       0.202
Diphtheria                          0.0572      0.006      9.249      0.000       0.045       0.069
HIV/AIDS                           -0.4918      0.026    -19.200      0.000      -0.542      -0.442
thinness  1-19 years               -0.0830      0.031     -2.719      0.007      -0.143      -0.023
Income composition of resources    13.9929      0.630     22.220      0.000      12.758      15.228
==============================================================================
Omnibus:                       88.663   Durbin-Watson:                   2.009
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              262.468
Skew:                          -0.121   Prob(JB):                     1.01e-57
Kurtosis:                       4.734   Cond. No.                     2.27e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.27e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [45]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe4.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe4.values,i) for i in range(X_train_rfe4.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Out[45]:
Features VIF
7 Diphtheria 29.69
5 Polio 25.80
3 Hepatitis B 18.59
10 Income composition of resources 13.90
4 BMI 7.83
6 Total expenditure 7.22
0 Status 6.05
1 Adult Mortality 4.30
9 thinness 1-19 years 3.95
8 HIV/AIDS 1.69
2 infant deaths 1.45

'Diphtheria' has a high VIF value hence dropping the feature from the training set.

In [46]:
X_train_rfe5 = X_train_rfe4.drop('Diphtheria',1,)

#adding constant
X_train_rfe5c = sm.add_constant(X_train_rfe5)

#building model
lm_rfe5 = sm.OLS(y_train, X_train_rfe5c).fit()

#summary
print(lm_rfe5.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.785
Model:                            OLS   Adj. R-squared:                  0.784
Method:                 Least Squares   F-statistic:                     748.8
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -6004.1
No. Observations:                2056   AIC:                         1.203e+04
Df Residuals:                    2045   BIC:                         1.209e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              59.1919      0.819     72.302      0.000      57.586      60.797
Status                             -2.6778      0.308     -8.682      0.000      -3.283      -2.073
Adult Mortality                    -0.0221      0.001    -20.928      0.000      -0.024      -0.020
infant deaths                      -0.0031      0.001     -3.256      0.001      -0.005      -0.001
Hepatitis B                        -0.0102      0.005     -2.018      0.044      -0.020      -0.000
BMI                                 0.0645      0.007      9.838      0.000       0.052       0.077
Polio                               0.0649      0.005     12.954      0.000       0.055       0.075
Total expenditure                   0.1491      0.045      3.310      0.001       0.061       0.237
HIV/AIDS                           -0.4937      0.026    -18.889      0.000      -0.545      -0.442
thinness  1-19 years               -0.0823      0.031     -2.644      0.008      -0.143      -0.021
Income composition of resources    14.7697      0.637     23.190      0.000      13.521      16.019
==============================================================================
Omnibus:                       92.191   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              273.609
Skew:                          -0.141   Prob(JB):                     3.86e-60
Kurtosis:                       4.765   Cond. No.                     2.17e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [47]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe5.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe5.values,i) for i in range(X_train_rfe5.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by ='VIF', ascending = False)
vif
Out[47]:
Features VIF
5 Polio 17.90
3 Hepatitis B 16.46
9 Income composition of resources 13.42
4 BMI 7.82
6 Total expenditure 7.12
0 Status 6.04
1 Adult Mortality 4.29
8 thinness 1-19 years 3.95
7 HIV/AIDS 1.69
2 infant deaths 1.45

'Polio' has a high VIF value hence dropping the feature from the training set.

In [48]:
X_train_rfe6 = X_train_rfe5.drop('Polio',1,)

#adding constant
X_train_rfe6c = sm.add_constant(X_train_rfe6)

#building model
lm_rfe6 = sm.OLS(y_train, X_train_rfe6c).fit()

#summary
print(lm_rfe6.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.768
Model:                            OLS   Adj. R-squared:                  0.767
Method:                 Least Squares   F-statistic:                     752.1
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:20   Log-Likelihood:                -6085.2
No. Observations:                2056   AIC:                         1.219e+04
Df Residuals:                    2046   BIC:                         1.225e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              61.5258      0.831     74.081      0.000      59.897      63.155
Status                             -2.8026      0.321     -8.742      0.000      -3.431      -2.174
Adult Mortality                    -0.0232      0.001    -21.164      0.000      -0.025      -0.021
infant deaths                      -0.0038      0.001     -3.849      0.000      -0.006      -0.002
Hepatitis B                         0.0135      0.005      2.766      0.006       0.004       0.023
BMI                                 0.0703      0.007     10.336      0.000       0.057       0.084
Total expenditure                   0.1839      0.047      3.932      0.000       0.092       0.276
HIV/AIDS                           -0.4923      0.027    -18.113      0.000      -0.546      -0.439
thinness  1-19 years               -0.0744      0.032     -2.297      0.022      -0.138      -0.011
Income composition of resources    16.2560      0.652     24.951      0.000      14.978      17.534
==============================================================================
Omnibus:                      124.779   Durbin-Watson:                   1.992
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              401.157
Skew:                          -0.247   Prob(JB):                     7.76e-88
Kurtosis:                       5.107   Cond. No.                     2.06e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [49]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe6.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe6.values,i) for i in range(X_train_rfe6.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by ='VIF', ascending = False)
vif
Out[49]:
Features VIF
3 Hepatitis B 12.86
8 Income composition of resources 11.81
4 BMI 7.70
5 Total expenditure 6.94
0 Status 6.00
1 Adult Mortality 4.29
7 thinness 1-19 years 3.92
6 HIV/AIDS 1.69
2 infant deaths 1.45

'Hepatitis B' has a high VIF value hence dropping the feature from the training set.

In [50]:
X_train_rfe7 = X_train_rfe6.drop('Hepatitis B',1,)

#adding constant
X_train_rfe7c = sm.add_constant(X_train_rfe7)

#building model
lm_rfe7 = sm.OLS(y_train, X_train_rfe7c).fit()

#summary
print(lm_rfe7.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.767
Model:                            OLS   Adj. R-squared:                  0.766
Method:                 Least Squares   F-statistic:                     842.4
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:21   Log-Likelihood:                -6089.0
No. Observations:                2056   AIC:                         1.220e+04
Df Residuals:                    2047   BIC:                         1.225e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              62.5312      0.748     83.601      0.000      61.064      63.998
Status                             -2.8044      0.321     -8.733      0.000      -3.434      -2.175
Adult Mortality                    -0.0233      0.001    -21.330      0.000      -0.025      -0.021
infant deaths                      -0.0043      0.001     -4.378      0.000      -0.006      -0.002
BMI                                 0.0712      0.007     10.456      0.000       0.058       0.085
Total expenditure                   0.1859      0.047      3.971      0.000       0.094       0.278
HIV/AIDS                           -0.4957      0.027    -18.225      0.000      -0.549      -0.442
thinness  1-19 years               -0.0701      0.032     -2.164      0.031      -0.134      -0.007
Income composition of resources    16.3814      0.651     25.164      0.000      15.105      17.658
==============================================================================
Omnibus:                      123.339   Durbin-Watson:                   1.995
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              400.688
Skew:                          -0.237   Prob(JB):                     9.81e-88
Kurtosis:                       5.110   Cond. No.                     1.90e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [51]:
vif = pd.DataFrame()
vif['Features'] = X_train_rfe7.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe7.values,i) for i in range(X_train_rfe7.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by ='VIF', ascending = False)
vif
Out[51]:
Features VIF
7 Income composition of resources 9.62
3 BMI 7.49
4 Total expenditure 6.53
0 Status 5.56
1 Adult Mortality 4.25
6 thinness 1-19 years 3.75
5 HIV/AIDS 1.68
2 infant deaths 1.41

Approach 3: Stepwise Regression

In [52]:
## By David Dale https://datascience.stackexchange.com/users/24162/david-dale
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.05, 
                       threshold_out = 0.10, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X_train, y_train)

print('resulting features:')
print(result)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:25: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Add  Schooling                      with p-value 0.0
Add  Adult Mortality                with p-value 2.94814e-217
Add  HIV/AIDS                       with p-value 8.85176e-80
Add  Diphtheria                     with p-value 6.63636e-50
Add  BMI                            with p-value 6.41342e-29
Add  Income composition of resources with p-value 6.43838e-22
Add  Status                         with p-value 1.13954e-15
Add  percentage expenditure         with p-value 9.00493e-08
Add  Polio                          with p-value 5.68587e-07
Add  Measles                        with p-value 8.01425e-06
Add  Hepatitis B                    with p-value 9.17377e-06
Add  under-five deaths              with p-value 0.00233237
Add  infant deaths                  with p-value 5.69409e-21
Add  thinness  1-19 years           with p-value 0.00227501
resulting features:
['Schooling', 'Adult Mortality', 'HIV/AIDS', 'Diphtheria', 'BMI', 'Income composition of resources', 'Status', 'percentage expenditure', 'Polio', 'Measles', 'Hepatitis B', 'under-five deaths', 'infant deaths', 'thinness  1-19 years']
In [53]:
X_train_stepwise = X_train[['Schooling', 'Adult Mortality', 'HIV/AIDS', 'Diphtheria', 'BMI', 'Income composition of resources', 'Status', 'percentage expenditure', 'Polio', 'Measles', 'Hepatitis B', 'under-five deaths', 'infant deaths', 'thinness  1-19 years']]
# Adding constant 
X_train_stepwise = sm.add_constant(X_train_stepwise)
# Building a model
lm_stepwise = sm.OLS(y_train, X_train_stepwise).fit()
# Summary
print(lm_stepwise.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Life expectancy   R-squared:                       0.823
Model:                            OLS   Adj. R-squared:                  0.822
Method:                 Least Squares   F-statistic:                     677.3
Date:                Tue, 22 Sep 2020   Prob (F-statistic):               0.00
Time:                        22:18:22   Log-Likelihood:                -5807.2
No. Observations:                2056   AIC:                         1.164e+04
Df Residuals:                    2041   BIC:                         1.173e+04
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              56.8039      0.756     75.122      0.000      55.321      58.287
Schooling                           0.6985      0.050     14.021      0.000       0.601       0.796
Adult Mortality                    -0.0197      0.001    -20.365      0.000      -0.022      -0.018
HIV/AIDS                           -0.4943      0.024    -20.731      0.000      -0.541      -0.448
Diphtheria                          0.0457      0.006      7.901      0.000       0.034       0.057
BMI                                 0.0491      0.006      8.134      0.000       0.037       0.061
Income composition of resources     5.8765      0.763      7.704      0.000       4.380       7.372
Status                             -1.8727      0.291     -6.433      0.000      -2.444      -1.302
percentage expenditure              0.0003   5.07e-05      5.555      0.000       0.000       0.000
Polio                               0.0272      0.005      5.004      0.000       0.017       0.038
Measles                         -2.367e-05   9.35e-06     -2.530      0.011    -4.2e-05   -5.32e-06
Hepatitis B                        -0.0194      0.005     -4.010      0.000      -0.029      -0.010
under-five deaths                  -0.0701      0.007     -9.950      0.000      -0.084      -0.056
infant deaths                       0.0940      0.010      9.797      0.000       0.075       0.113
thinness  1-19 years               -0.0868      0.028     -3.056      0.002      -0.143      -0.031
==============================================================================
Omnibus:                      106.745   Durbin-Watson:                   1.971
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              308.003
Skew:                          -0.225   Prob(JB):                     1.31e-67
Kurtosis:                       4.842   Cond. No.                     1.05e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Model Prediction and Evaluation

In [56]:
# Prediction with Training set

X_test_stepwise = X_test[['Schooling', 'Adult Mortality', 'HIV/AIDS', 'Diphtheria', 'BMI', 'Income composition of resources', 'Status', 'percentage expenditure', 'Polio', 'Measles', 'Hepatitis B', 'under-five deaths', 'infant deaths', 'thinness  1-19 years']]
X_test_stepwise = sm.add_constant(X_test_stepwise)
actual = y_test['Life expectancy']
prediction = lm_stepwise.predict(X_test_stepwise)
In [57]:
# Model Evaluation: Calculating Mean Squared Errors
model_mse = mean_squared_error(prediction, actual)
print(model_mse)
15.972714682411294
In [58]:
# Model Evaluation: Calculating Mean Absolute percentage Errors
def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) *100
In [59]:
mape(actual,prediction)
Out[59]:
4.558248666207472
In [60]:
# Linearity check
sns.scatterplot(y_test['Life expectancy'],prediction)
plt.title('Linearity Check')
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
Out[60]:
Text(0, 0.5, 'Predicted value')
In [61]:
# Histogram to analyse error terms
fig = plt.figure()
sns.distplot((y_test['Life expectancy'] - prediction), bins = 20)
fig.suptitle('Error Term Analysis', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)
Out[61]:
Text(0.5, 0, 'Errors')
In [ ]: